Tone-Group F0 selection for modeling focus prominence in small-footprint speech synthesis
نویسندگان
چکیده
This work targets to improve the naturalness of synthetic intonational contours in Text-to-Speech synthesis through the provision of prominence, which is a major expression of human speech. Focusing on the tonal dimension of emphasis, we present a robust unit-selection methodology for generating realistic F0 curves in cases where focus prominence is required. The proposed approach is based on selecting Tone-Group units from commonly used prosodic corpora that are automatically transcribed as patterns of syllables. In contrast to related approaches, patterns represent only the most perceivable sections of the sampled curves and are encoded to serve morphologically different sequence of syllables. This results in a minimization of the required amount of units so as to achieve sufficient coverage within the database. Nevertheless, this optimization enables the application of high-quality F0 generation to small-footprint text-to-speech synthesis. For generic F0 selection we query the database based on sequences of ToBI labels, though other intonational frameworks can be used as well. To realize focus prominence on specific Tone-Groups the selection also incorporates a level indicator of emphasis. We set up a series of listening tests by exploiting a database built from a 482-utterance corpus, which featured partially purpose-uttered emphasis. The results showed a clear subjective preference of the proposed model against a linear regression one in 75% of the cases when used in generic synthesis. Furthermore, this model provided ambiguous percept of emphasis in an experiment featuring major and minor degrees of prominence. 2006 Elsevier B.V. All rights reserved.
منابع مشابه
Perceptual equivalence of approximated Cantonese tone contours
This paper describes a perceptual study on approximated Cantonese tone contours. We believe that the perception of tone contours relies mainly on the major trend of pitch movement, and is not sensitive to the exact F0 values at particular time instants. The tone contours of individual syllables and the transition between them are approximated as a small number of linear movements. The effect of...
متن کاملGeneration of Fundamental Frequency Contours of Mandarin in HMM-based Speech Synthesis using Generation Process Model
The HMM-based speech synthesis system can produce high quality synthetic speech with flexible modeling of spectral and prosodic parameters. In this approach, short term spectra, fundamental frequency (F0) and duration are generated by multi-stream HMMs separately. However the quality of synthetic speech degrades when feature vectors used in training are noisy. Among all noisy features, pitch tr...
متن کاملGeneration of fundamental frequency contours for Mandarin speech synthesis based on tone nucleus model
A new method for generating sentence F0 contours of Mandarin speech is proposed. The method assumes the F0 contour generation process model, but generates the tone and phrase components in different ways and sums them to produce a sentence F0 contour. The tone component is generated concatenating F0 patterns of tone nuclei, which are predicted by a corpus-based scheme (binary decision trees). E...
متن کاملFocus, Lexical Stress and Boundary Tone: Interaction of Three Prosodic Features
This paper studies how focus, lexical stress and rising boundary tone act on F0 of the last preboundary word. We find that when the word is non focused, the rising boundary tone takes control almost from the beginning of the word and flattens F0 peak of the lexical stress. When the word is focused, the rising boundary tone is only dominant after F0 peak of lexical stress is formed. This peak is...
متن کاملTone modeling using Gaussian process latent variable model for statistical speech synthesis
In continuous speech of Thai language, tone pronunciation is affected by several factors. One of significant factors is stress that causes a diversity of F0 contours of tone, and affects syllable durations. Our previous studies have shown that a stressed/unstressed syllable context improves tone modeling accuracy. However, the stress in Thai language is generally unknown for a given input text ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Speech Communication
دوره 48 شماره
صفحات -
تاریخ انتشار 2006